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Meaning has been left outside most theoretical approaches to information in biology. Functional responses 
based on an appropriate interpretation of signals have been replaced by a probabilistic description of 
correlations between emitted and received symbols. This assumption leads to potential paradoxes, such as 
the presence of a maximum information associated to a channel that creates completely wrong 
interpretations of the signals. Game-theoretic models of language evolution and other studies considering 
embodied communicating agents show that the correct (meaningful) match resulting from agent-agent 
exchanges is always achieved and natural systems obviously solve the problem correctly. Inspired by the 
concept of duality of the communicative sign stated by the swiss linguist Ferdinand de Saussure, here we 
present a complete description of the minimal system necessary to measure the amount of information that 
is consistently decoded. Several consequences of our developments are investigated, such as the uselessness 
of a certain amount of information properly transmitted for communication among autonomous agents. 

Major innovations in evolution have been associated with novelties in the ways information is coded, 
modified and stored by biological structures on multiple scales 1 . Some of the major transitions involved 
the emergence of complex forms of communication, being human language the most prominent and 
difficult to explain 2 . The importance of information in biology has been implicitly recognized since the early 
developments of molecular biology, which took place simultaneously with the rise of computer science and 
information theory. Not surprisingly, many key concepts such as coding, decoding, transcription or translation 
were soon incorporated as part of the lexicon of molecular biology 3 . 

Communication among individual cells promoted multicellularity, which required the invention and diver- 
sification of molecular signals and their potential interpretations. Beyond genetics, novel forms of non-genetic 
information propagation emerged. At a later stage, the rise of neural systems opened a novel scenario to interact 
and communicate with full richness 2 . Human language stands as the most complex communication system and, 
since communication deals with the creation, reception and processing of information, understanding commun- 
ication in information theoretic terms has become a major thread in our approach to the evolution of language. 

In its classical form, information theory (IT) was formulated as a way of defining how signals are sent and 
received through a given channel with no attention to their meaning. However, in all kinds of living systems, from 
cells sharing information about their external medium, individuals of a given species surviving in a world full of 
predators or when two humans or apes exchange signals, a crucial component beyond information is its mean- 
ingful content 4 . The distinction is very important, since information has been treated by theoreticians since 
Shannon's seminal work 5 as a class of statistical object that measures correlations among sets of symbols, whereas 
meaning is inevitably tied to some sort of functional response with consequences for the fitness of the commun- 
icating agents. This standard scheme describing information transmission through a noisy channel 5 is summar- 
ized in figure (l)a. The most familiar scenario would be described by a speaker (S) and a listener or receiver (R) 
having a conversation in a living room. The air carries the voice of the first and is the channel, which would be 
reliable (low or zero noise) if nothing except R and S were present. Instead, the channel will become more and 
more unreliable (noisy) as different sources of perturbation interfere. These can be very diverse, from air 
turbulence and children laughing to another conversation among different people. Consistently with any stand- 
ard engineering design, Shannon's picture allows us to define efficient communication in terms somewhat similar 
to those used -for example- within electric transmission networks. In this case, a goal of the system design is 
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Figure 1 | In standard theory of information, as defined in Shannon's 
theory, a communication system (a) is described in terms of a sequential 
chain of steps connecting a source of messages (S) and a final receiver (R). 
The source can be considered linked to some external repertoire of objects 
(Q). An encoder and a decoder participate in the process and are tied 
through a channel A, subject to noise. The acquisition and evolution of a 
language, as it happens in artificial systems of interacting agents, like robots 
(b), involves some additional aspects that are usually ignored in the 
original formulation of Shannon's approach. Those include the 
embodiment of agents and the necessary consistency in their 
communicative exchanges emerging from the their perceptions of the 
shared, external world. Picture courtesy of Luc Steels. 

minimizing the heat loss during the transmission process. 
Information is a (physically) less obvious quantity, but the approach 
taken by standard IT is quite the same. 

As a consequence of its statistical formulation, IT does not take 
into account "meaning" or "purpose" which, as noted by Peter 
Schuster 1 , are also difficult notions for evolutionary biology. 
Despite this limitation, it has been shown to successfully work in 
the analysis of correlations in biology 6 . However, one undesirable 
consequence of this approach is that some paradoxical situations 
can emerge that contradict our practical intuition. An example is 
that a given pair of signals s l5 s 2 associated to two given objects or 
events from the external world could be "interpreted" by the receiver 
of the messages in a completely wrong way -"fire" and "water", for 
example, could be understood, as "water" and "fire", respectively. 
Measured from standard IT -see below- the information exchanged 
is optimal -even perfect- if "fire" ("water") is always interpreted as 
"water" ("fire"). In other words, full miscommunication can also 
score high, as perfectly "efficient", within Shannon's framework. 



Therefore, one should approach the communicative sign as a dual 
entity that must be preserved as a whole in the communicative 
exchange. This crucial duality sign in communicative exchanges 
was already pointed out -with some conceptual differences to the 
version we will develop below-before the birth of information theory 
by the Swiss linguist Ferdinand de Saussure in his acclaimed Cours de 
linguistique generate 7 . 

It seems obvious that meaning -and its connection to some signal, 
in order to create the dual entity- plays an essential role and has been 
shaped through evolution: "the message, the machinery processing 
the message and the context in which the message is evaluated are 
generated simultaneously in a process of coevolution" 1 . In our 
bodies, proper recognition of invaders is essential to survival, and 
failures to recognizing the self and the non-self are at the core of 
many immune diseases 8,9 . Similarly, learning processes associated to 
proper identification of predators and how to differentiate them 
from inmates are tied to meaningful information. Beyond the specific 
details associated to each system, correct information storing and 
sharing, and the relevance of meaning is well illustrated by its impact 
on evolutionary dynamics. As pointed out in 3 we can say that, in 
biology, the coder is natural selection. In this way, the use of evolu- 
tionary game theoretic arguments has played a very important role in 
shaping evolutionary approaches to language and commmunica- 
tion 1015 , but require some extension in order to properly account 
for meaningful information. Moreover, evolutionary robotics and 
the artificial evolution of protolanguages and pro to -grammars is a 
unique scenario where such a framework naturally fits 16 " 22 . Evolving 
robots capable of developing simple communication skills are able of 
acquiring a repertoire of appropriate signals, share them and inter- 
pret correctly the signals sent by other agents. The coherent develop- 
ment of a shared set of symbols that is correctly used -and thus where 
"meaning" is preserved- becomes central. Such coherence results 
from the combination of a shared repertoire of signals together with 
a shared perception of the external world, as detected and perceived 
by the same class of sensing devices. 

In this paper we develop and describe an information-theoretic 
minimal system in which the signal is linked to a referential value. 
This relation is assumed to be simple and direct, so that no other 
process than the mapping is assumed. Other forms of more complex 
meaning associations would deviate from the spirit of the paper, 
which is to introduce the minimum framework accounting for the 
conservation the simplest form of meaning. In a nutshell, we are 
going to derive an information-theoretic measure able to grasp the 
consistency of the shared information between agents, when mean- 
ing is introduced as a primitive referential value attributed to one or 
more signals. 

Results 

We start this section describing the minimal system incorporating 
referential values for the sent signals. Within this system, we show 
what is meant when we say that information theory is blind to any 
meaning of the message. We then derive the amount of consistently 
decoded information between two given agents exchanging informa- 
tion of their shared world, thereby fixing the problem pointed out 
above, and analyze some of its most salient properties, including the 
complete description of the binary symmetric channel within this 
new framework. 

The minimal system encompassing referentiality. Our minimal 
system to study the referential or semantic consistency of a given 
information exchange will involve two autonomous communicative 
agents, A, B, a channel, A, and a shared world, Q. Agents exchange 
information about their shared world through the channel -see 
figure (2). Now we proceed to describe it in detail. 
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Figure 2 | Minimal communicative system to study the conservation of 
referentiality (a): A shared world, whose events are the members of the set 
Q and whose behavior is governed by the random variable Xq. A coding 
engine, P A , which performs a mapping between Q and the set of signals <S, 
being X s the random variable describing the behavior of the set of signals 
obtained after coding. The channel, A, maybe noisy and, thus, the input of 
the decoding device, Q B , depicted by X' s , might be different from X 5 . Q B 
performs a mapping between S and Q, whose output is described by X' Q . 
Whereas mutual information provides a measure of the relevance of the 
correlations between Xq and X' Qy consistent information evaluates the 
relevance of the information provided by consistent pairs with regard to 
the overall amount of information. In this context, from a classical 
information-theoretical point of view, situations like b) and c) could be 
indistinguishable. By defining the so-called consistent information we can 
properly differentiate b) and c) by evaluating the degree of consistency of 
input/output pairs -see text. 

Description. An agent, A, is defined as a pair of computing devices, 

A={P A ,Q A }, (1) 

where P A is the coder module and Q A is the decoder module. The 
shared world is defined by a random variable Xq, which takes values 
from the set of events, Q, Q = {mi, . . ., m n ], denoting the (always non- 
zero) probability associated to any event m k e Q asp(m^). The coder 
module, P A , is described by a mapping from Q to the set of signals: 
S = {si, . . . ,s n }. We will here assume |Q| = |<S| = n, unless the con- 
trary is indicated. The mapping that represents the coder module is 
defined by means of a matrix of conditional probabilities P A , whose 
elements = P A (s ; -|m;) satisfy the normalization conditions 
(namely, for all m { e Q, ^^.^ = !)• The outcome of the coding 

process is depicted by the random variable X 5 , taking values from S 
according to a probability distribution 



The channel A is characterized by the n X n matrix of conditional 
probabilities A, with matrix elements A^ = U\ The random 

variable X' s describes the output of the composite system world + 
coder + channel, thereby taking values on the set *S, and follows the 
probability distribution q', defined as 

k j<n 

Finally, the decoder module is a computational device described by a 
mapping from S to Q; i.e. it receives S as the input set, emitted by 
another agent through the channel, and yields as output elements of 
the set Q. Q A is completely defined by its transition probabilities, 
namely, = P A (m^|s/), which satisfy the normalization conditions 
(i.e., for all S/Gc>, ^^ k< = !)• We emphasize the assumption that, 
in a given agent A, following [14, 15] (but not [10, 11]) there is a 
priori no correlation between P A and Q A . 

Now suppose that we want to study the information transfer 
between two agents sharing the world. Let us consider A the encoder 
agent and B the decoder one, although we emphasize that both agents 
can perform both tasks. Agent B tries to reconstruct Xq from the 
information received from A. The description of Q made by agent B 
is depicted by the random variable Xq, taking values on the set Q and 
following the probability distribution p', which takes the form: 

p'(m/) = ^2p(mi)P AB (mi\mi), (4) 

l<n 

where 

PabK-KH £P A A ; >Q*. (5) 

j,r<n 

From which we can naturally derive the joint probabilities, 
Pab (fWijtf*;) as follows: 

Pab (m,-,m;) = J2p( m j)Pfl A 'rQl (6) 

l,r 

We say that X Q is the reconstruction of the shared world, Xq, made by 
agent B from the collection of messages sent by A. Summarizing, we 
thus have a composite system where the behavior at every step is 
described by a random variable, from the description of the world, 
Xq to its reconstruction, Xq -see figure (2a): 

X n ~p X s ~q X' s ~q' X' a ~p' 

Q^A^>A^>B^Q. (7) 

At this point, it is convenient to introduce, for the sake of clarity, 
some new notation. We will define two matrices, namely J(AB) and 
A(AB) in such a way that Ji/(AB) = Pab (m^mj) and 
Ay(AB) = Pab (m ; -|m,-). Finally, we will define the probability distri- 
bution A f (AB) = {A a (AB), A,-„(AB)}. This new notation will 
enable us to manage formulas in a more compact way. 

Information-theorethic aspects of this minimal system. First we shall 
explore the behaviour of mutual information in this system. Detailed 
definitions of information-theory functionals used in this subsection 
are provided in the Methods section. Under the above described 
framework, we have two relevant random variables: the world Xq 
and the reconstruction of the world Xq. Its mutual information 
J(Xq : X' Q ) is defined as 5 ' 23 ' 24 : 

l(X n :X , n )=H(X n )-H(X n \X , n ). (8) 
The above expression has an equivalent formulation, namely 
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l(X n : X' n 



S>(AB)log Mffy 
^— ' p(mi)q[mj) 



(9) 



i,j < n 



where the right side of the above equation can be identified as the 
Kullback-Leibler divergence between distributions J(AB) and p • q: 

l(X Ci :X' cl )=D(l(AB)\\p-q). (10) 

Within this formulation, the mutual information is the amount of 
accessory bits needed to describe the composite system Xq, Xq taking 
as the reference the distribution p • q, which supposes no correlation 
between Xq and Xq. 

Let us underline a feature of mutual information which is relevant 
for our purposes. As is well-known, max I(Xq,Xq) <H(Xq), and 
equality holds if there is no ambiguity in the information processing 
process, meaning that the process is reversible, in logical terms. Thus, 
every event m f e Q has to be decoded with probability 1 to some 
event m ; e Q which, in turn, must not be the result of the coding/ 
decoding process of any other event. In mathematical terms, this 
means that P A , Q B , A £ n nXn , being Yl n x n the set of n X n per- 
mutation matrices, which are the matrices in which every file and 
column contains n — 1 elements equal to 0 and one element equal to 
1 -see Methods section. It is worth emphasizing that (5„ x «> the n X n 
identity matrix is itself a permutation matrix. Notice that if A(AB) ^ 
5 some symbol m { sent by the source is decoded as a different element 
rrij. This shift has no impact on the information measure 
J(Xq : Xq) if A(AB)eIT n x n , and this is one of the reasons by which 
it is claimed that the content of the message is not taken into account 
in the standard information measure. Actually, it is straightforward 
to show -see Appendix B- that only n\ out of the (n\) 3 configurations 
leading to the maximum mutual information also lead to a fully 
consistent reconstruction -i.e., a reconstruction where referential 
value is conserved. This mathematically shows that, for autonomous 
agents exchanging messages, mutual information is a weak indicator 
of communicative success. 

Derivation of consistent information. Now we have a complete 
description of the minimal system able to encompass referential 
values for the sent signals. It is the objective of this section to 
derive an information-theoretic measure, different from mutual 
information, that will allow us to evaluate the amount of 
consistently decoded information. 

Preliminaries. The rawest evaluation of the amount of consistently 
decoded pairs is found by averaging the probability of having a 
consistent coding/decoding process during an information exchange 
between agent A and agent B. This corresponds to the view of an 
external observer simply counting events and taking into account 
only whether they are consistently decoded or not. This probability, 
denoted as 0 AB , is obtained by summing the probability of having 
consistent input output pair, i.e.: 



0AB=tr/(AB)=^Ji f (AB) 



(11) 



This formula has been widely used as a communicative payoff for an 
evolutionary dynamics in which consistent communication has a 
selective advantage 11,1415 . We observe that the probability of error 
p e (AB) in this scenario is given by p e (AB) = 1 — 9 AB . Therefore, 
thanks to Fano's inequality -see Methods section-, we can relate this 
parameter to the information-theoretic functionals involved in the 
description of this problem, namely: 



0AB < 1 " 



log(n-l) 



(12) 



From this parameter, we can build another, a bit more elaborated 
functional. We are still under the viewpoint of the external observer 



who is now interested in the fraction of information needed to 
describe the composite system Xq, Xq that comes from consistent 
input/output pairs when information is sent from A to B. This frac- 
tion, to be named <7 AB , is: 



o'ab = 



tr( /(AB)log/(AB)) 

H(Xq,X^) 



(13) 



We observe that the above quantity is symmetrical in relation to Xq 
and Xq. These two estimators provide global indicators of consist- 
ency of the information exchange. 

Consistent information. However, we can go further and ask us how 
much of the information from the environment is consistently decoded 
by agent B when receiving data from A. As a first step, we observe that, 
since Jy(AB) = p(m,-)Ay(AB), we can rewrite equation (9) as: 



/(*> : = 5>(m,) £ A,(AB)log^ 

i<n j<n V \ m j) 



£>(m,)D(A ( (AB)||//) 



(14) 



Knowing that D(A,-(AB)||g) is the information gain associated to 
element m b p(m I -)D(A I -(AB)||^) is its weighted contribution to the 
overall information measure. If we are interested in the amount of 
this information that is consistently referentiated, we have to add an 
"extra" weight to p(m,-), namely A z/ (AB), which is the probability of 
having m { both at the input of the coding process and at the output. 
Thus, since 

A«(AB)p(m < )D(A < (AB)||^)=/«(AB)D(A j (AB)||//), (15) 

the amount of consistent information conveyed from agent A to agent 
B,J(AB), will be: 



J(AB)= ^/,(AB)D(A z -(AB)|| j p / ). 



(16) 



Since this is the most important equation of the text, we rewrite it 
using standard probability notation: 



Z(AB) = PWm.-, mi) ^ P^Hm,)^ 



.(17) 



We observe that the dissipation of consistent information is due to 
both standard noise H(Xq|X q ), and another term, which is sub- 
tracted to /(Xq : Xq), accounting for the loss of referentiality. 
Using equations (8, 9) and (16) we can isolate this new source of 
information dissipation, the referential noise, v(AB), leading to: 



v(AB)=£D(A,(AB)|| 9 ) 



I>( AB ) 



(18) 



Therefore, the total loss of referential information or total noise will 
be described as 

>/(AB)=H(Xq|X q )+v(AB). (19) 

The above expression enables us to rewrite equation (16) as: 

I (AS) =H(X Q )-rj(AB), (20) 

which mimics the classical Shannon Information, now with a more 
restrictive noise term. Interestingly, the above expression is not sym- 
metrical: the presented formalism distinguishes the world, Xq, from 
its reconstruction, X' Q . If we take into account that, according to the 
definition we provided for an autonomous communicating agent, the 
information can flow in both senses (A — > B and B —> A), we can 
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compute the average success of the communicative exchange 
between A and B, X(A : B), as: 



T(A : B) =H(Xa) - - (f/(AB) + fj(HA)). 



(21) 



X(A : B) is the consistent information about the world Q shared by 
agents A and B. In contrast to the previous one, the above expression 
is now symmetrical, X(A : B) = X(B : A), because both agents share 
the same world, represented by X Q . We remark that this is an 
information-theoretic functional between two communicating 
agents, it is not an information-measure between two random vari- 
ables, like mutual information is. This equation quantifies the com- 
munication success between two minimal communicating agents A, 
B transmitting messages about a shared world. 

Properties. In this section we draw several important consequences 
from the treatment just presented, based on the consistent 
information concept. The rigorous and complete proofs behind 
them can be found in the Methods section, together with a brief 
discussion about the actual consistency of this measure when 
applied to single agents in a population (i.e., the 'self-consistency' 
or coherence that an individual agent should also keep about the 
world). 

The binary symmetric channel. We first consider the simplest case, 
from which we can easily extract analytical conclusions that help us 
gain intuition: the Binary Symmetric Channel with uniform input 
probabilities. We are concerned with a world Q having two events 
such thatp(l) = p{2) = 1/2, two agents A and B sharing information 
about this world, and a binary channel, A. The agents' and channel 
configuration are assumed to be of the following form: 



A(AB) = 



(22) 



being A(AB) = P A AQ B , as defined at the beginning of the results 
section. We will refer to e as the referential shift, which is the prob- 
ability that a given event is wrongly decoded in the reconstruction of 
Q. In this minimal system all functionals can be easily evaluated. 
First, we have that /(Xq, X n ) = l—H(e), and that Oab = 1 — 6 > being 
H(e) the entropy of a Bernouilli process having parameter e -see 
Methods section. This leads to the following expression of the con- 
sistent information: 



I (AS) = 6> AB (1 -H(e)) = 6W(*Q, X' a ). 



We can also easily compute cr AB : 
cab = $ab 



1-lpg^AB 

2-H(e) • 



(23) 



(24) 



The behavior of consistently decoded information is shown in fig- 
ure (3). In these plots we confronted the behavior of /(Xq : X n ), 
H(Xq, Xq) and H(Xq|X q ) with their analogous counterparts when 
refer entiality is taken into account, nalemy X(AB) and ct ab and 
v(AB) (and 77 (AB)) respectively. We can observe the symmetric 
behavior of the first ones against e, which highlights the total insens- 
ibility to referentiality conservation of these classical measures. 
Instead, we observe that X(AB), cr AB , n(AB) and v(AB) do reflect 
the loss of referentiality conservation, showing a non- symmetric 
behavior with a generally decreasing trend as referentiality is pro- 
gressively lost. 

Decrease of information due to referential looses. One interesting 
consequence of equation (23) is that, except for very restricted situa- 
tions, the presence of noise has a negative impact on the value of the 
consistent information, leading to the general conclusion that: 




Figure 3 | The binary symmetric channel when we enrich the 
communication system with a referential set shared by coder and decoder 
agent. Plots correspond to the different values of the binary symmetric 
channel along e, the referential shift parameter, from e = 0 (total 
information with no loss of referentiality) to e = 1 (total information with 
total loss of referentiality). On the left, from top to bottom, we have the 
classical, well known plots of I(Xq : Xq), H\Xq, Xq) (normalized to 1) 
and H(Xq|X q ). On the right, we have the equivalent ones accounting for 
the referentiality conservation, namely, on top, X(AB), next, cr AB and in 
the last plot, we have rj(AB) (black line) and v(AB) (red line). Units are 
given in bits. We observe that both l(X Q : X' Q ) (and H(X Q |X Q )) have a 

symmetric behavior, with a minimum (maximum) at e = - (total 
uncertainty). On the contrary, X(AB) does not show a symmetric 
behavior, showing two minima, at e= - and at e= 1. There is a local 
maxima at about e~0.85, which is a by-product of the combination of the 
loss of uncertainty of the system and a small but non-vanishable degree of 
referentiality conservation. 



l(AB)<l(X n :X' n ) 



(25) 



This latter inequality shows that, in most cases, in the absence of a 
designer, part of the information properly transmitted is actually 
useless for communication in a framework of autonomous agents. 
As demonstrated in the Methods section, the strict inequality holds 
in general. Indeed, the above relation becomes equality only in the 
very special case where there is perfect a matching between the two 
agents (i.e.: A(AB) = ^ x «, being d nXn the n X n identity matrix.) or 
trivially, in the case where J(Xq : X n ) = 0. 

But we can go further. Let us consider that we know that the 
system displays a given value of I(Xq : Xq) and, by assumption, 
we also know H(X Q ). In these conditions, one can easily derive 
H(Xq|Xq) by simply computing H(Xq) — I (Xq : X Q ). But it is pos- 
sible to set a bound to the value of X(AB) as well. As in many 
problems of information theory, the general case is hard, even 
impossible to deal with. However, several approaches become viable 
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in special but illustrative cases. Let us assume the paradigmatic 
configuration in which (Vm* €E Q)p(m f ) = 1/n and where A(AB) 
acts as a symmetric channel. In this case, we have that 
J(AB) < ^ab^(^q : Xq), where 



H(X Q ) ' 



and, therefore: 



(26) 



(27) 



(See the Methods section for the details of the above derivations). 
This tells us, after some algebra, that in this framework, 



f /(AB)>2H(X n |X, 



M H 2 (X n \X' n 



H(Xa) 



+ 1. 



(28) 



Therefore, for H(X a )^>H(X a \X' a ), we have that 
f/(AB) > 2H(X fi |X„), leading to 

J(AB) < H(Xn) -2H(X n |X^) (29) 

and, for example, for the case in which H (Xq) ~2H (Xq \X' n ) we have 
that: 



T(AB)<H(Xn)--H(X n \X' n ) 
= 1 -I(X Q ,X' CI ), 



(30) 



The above examples enable us to illustrate the strong impact of noise 
on the conservation of the referential value within a communication 
exchange -stronger than the one predicted by standard noise. 

Discussion 

Shannon's information theory had a great, almost immediate impact 
in all sorts of areas, from engineering and genetics to psychology or 
language studies 25 . It also influenced the work of physicists, particu- 
larly those exploring the foundations of thermodynamics, who found 
that the entropy defined by Shannon provided powerful connections 
with statistical mechanics, particularly in terms of correlations. It is 
mainly at that level -i. e. the existence of correlations among different 
subsystems of a given system- that the use of information theory has 
proved useful. But correlations do not ensure a crucial type of coher- 
ence that seems necessary when dealing with meaningful commun- 
ication: the preservation of referentiality. 

In this paper we have addressed a especially relevant problem, 
namely the development of an information -theoretic framework able 
to preserve meaning. This is a first step towards a more general goal, 
which would involve establishing the basis for an evolutionary theory 
of language change including referentiality as an explicit component. 
We have shown that, if consistent information is considered, its value 
is significantly lower than mutual information in noisy scenarios. We 
have derived an analytical form of consistent information, which 
includes referential noise along with the standard noise term. Our 
information measure defines a non- symmetrical function and prop- 
erly weights the -more strict- requirement of consistency. We have 
illustrated our general results by means of the analysis of a classical, 
minimal scenario defined by the binary symmetric channel. The 
approach taken here should be considered as the formally appropri- 
ate framework to study the evolution of communication among 
embodied agents, where the presence of consistency is inevitable 
due to shared perception constraints. Moreover, it might also be 
useful as a consistent mathematical framework to deal with cognit- 
ive-based models of brain-language evolution 26 " 28 . At this point, 
we should point out an important issue: Consistency of the 



communicative exchange is here evaluated between agents, not intern- 
ally to a given agent talking to itself. Actually, there is no a priori any 
correlation between the coding and the decoding modules of a given 
agent. In doing so, we take the viewpoint proposed by [14] and [15]. 
Other approaches assumed an explicit link between the coding and 
decoding modules of the agent, thereby avoiding from the beginning 
the paradoxical situation in which two agents perfectly understand 
each other but, at the same time, they are not able to understand 
themselves [10, 11]. However, as shown in 29 , this situation is unlikely 
to occur under selective pressures, for the frameworks depicted by 
these earlier works. In the Methods section is shown that the proposed 
framework has also the same property, i.e., that the maximisation of 
consistent communication in a given community of agents leads to the 
self- consistency of each of them, without the need of imposing it 
externally, thereby simplifying the mathematical apparatus. 

The framework we have developed is somehow inspired by 
Saussure's duality of sign: a (linguistic) sign is a twofold entity com- 
pounded of a signifier and a signified. However, it must be mentioned 
that there is a substantial difference between the theory we have 
developed and a Saussurean approach. According to Saussure, the 
relation between a signifier and a signified is fixed with respect to the 
linguistic community that uses the sign. "The masses have no voice in 
the matter, and the signifier chosen by language could be replaced by 
no other". Saussure adopts therefore a 'static' approach to the study of 
signs, whereas we adopt a dynamic perspective that allows us to 
address the possibility that different agents assign different meanings 
to the same symbol, in which case referentiality is not preserved. In 
this way we extend evolutionary game-theoretic arguments in order 
to derive a measure of consistency of the shared information between 
agents by incorporating the (non-)preservation of referentiality. 

In the presented work we took the simplest possible form of mean- 
ing, namely, its referential object. However, we said nothing about 
the object itself. Further works might explore the inclusion in the 
above proposed framework an explicit quantification of meaning 
beyond its referential value, to rank events of the world and to refine 
the role of the information functional to evaluate proper commun- 
ication exchanges in selective scenarios. In addition, new hallmarks 
beyond the agent-channel-agent should be explored, leading to new 
forms of information which play a role in biological organisation and 
which are poorly reflected in such a schema. 

Methods 

Definitions. Information theoretic junctionals. The following definitions are intended 
to be minimal. We refer the interested reader to any standard textbook on 
information theory, such as [23] or [24] . 

.-Given a random variable Xq taking values over the set Q following a probability 
distribution p, 



H(X Q ) = -^p(m ! )logp(m ! ) 

i<n 

is the standard Shannon or statistical entropy. 
.-Given two random variables, Xq and X' Q , 

h(x q \x' q ) = ~J2 i{m) J2 pKK) log p(>4 

i<» ;<» 



(31) 



(32) 



is the conditional entropy of X Q with respect Xq, being, in that case, 
P (mj | m l ) = P (X Q = mj \X' Q = m i ). Additionally, 

H(X n ),X'n=- E *M E lo g ^(rn h m t ) (33) 

\<n )<n 

where P (my,m z ) = P (Xq = m^X^ = m ; ) is the joint entropy of the two random vari- 
ables X fi ,X Q . 

.-Given two probability distributions iti, n 2 defined over the set Q, the Kullback- 
Leibler divergence of relative entropy of Ti\ with respect n 2 is: 

7ti(Xi) 



D{ni\\n 2 )= E^Mtog 



7l 2 (XiY 



(34) 
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which is the amount of extra information we need to describe n x taking as the 
reference distribution n 2 . 

.-Fano's inequality. The probability of error in decoding is bounded satisfies the 
following inequality: 



H(X n \X f n )- 
log(n-l) 



(35) 



.-A Bernoulli process is a stochastic process described by a random variable X 
taking value in the set A = {0, 1}, being p(0) = 1 — e andp(l) =e. e is the parameter 
of the Bernoulli process. Its entropy H(X) is commonly referred as H(e), since it only 
depends on this parameter: 



ff(e) = -(l-e)log(l-e)-eloge. 



(36) 
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Permutation matrices. A permutation matrix is a square matrix which has exactly one 
entry equal to 1 in each row and each column and 0's elsewhere. For example, if n = 3, 
we have 6 permutation matrices, namely: 



(37) 



The set of n X n permutation matrices is indicated as Tl n Xn and it can be shown that, if 
A G n„ x «> A -1 = A r G n MX „ and, if A, B G U nXn , the product AB G U nXn . 
Furthermore, it is clear that b nXn e n„ X m being (5 the identity matrix or Kronecker 
symbol, defined as <5y = 1 if / = j and <5y = 0, otherwise. 

Inequalities. We present the inequalities described in the main text in terms of three 
lemmas on the upper bounds of Z(AB). The first one concerns inequality (25). The 
second one is general and supports the third, which proves inequality (27): 

Lemma 1.- Let AB be two agents sharing the world Q. The Amount of consistent 
information transmitted from A to B -when A acts as the coder agent and B as the 
decoder one- satisfies that 

X(AB)=l(X n :X' Q ) (38) 
only in the following two extreme cases: 

1. 7(X fi :Xy=0,Or 

2. A(AB) = 5 nXn . 

Otherwise, J(AB) <l(X Q : X' a ). 

Proof.- The first case is the trivial one in which there is no information available due 

to total uncertainty -corresponding to e= - in the case of the symmetric binary 

channel studied above, see also figure (3). The second one is more interesting. Indeed, 
having A(AB) = S means that 



(P A ,A,Q B eII HX „) and P A = (AQ B ) 



(39) 



where we use that, if C G Yl n Xn , C" 1 = C T , also having that C r G H nXn . Out of these 
two situations, 3/, fc (AB) > 0, in which i 7^ k, since there are more than n non-zero 
entries in the matrix A(AB), leading to 



l(AB)<l(X n :X> n ] 



(40) 



Lemma 2- Let AB be two agents sharing the world Q. The Amount of consistent 
information transmitted from A to B -when A acts as the coder agent and B as the 
decoder one- is bounded as follows: 



KABi<ll- H |^' 



max{D(A j (AB)||p')} 



(41) 



Proof.- Let v and u be two vectors of R n . Its scalar product, (v,u) E;<„ 
bounded, thanks to the so-called Holder's inequality, in the following way: 

i i 



?,s>|< 



(42) 



as long as a and /? are Holder conjugates, i.e., 1/a + 1//? = 1. The above expression can 



= 1/2 we recover the well-known Schwartz inequality for the euclidean distance. If we 
put a — » 1 and ^> °° we obtain 



PL 



where 



h=J2vr, and ||w|| 00 = max{u i }, 



(43) 



(44) 



being the last one the so-called Chebyshevs norm. Now we want to apply this 
machinery to our problem. The key point is to realize that Z(AB) can be expressed as 
a scalar product between two vectors, having the first one coordinates / n (AB), . . ., 
/„„(AB) and the second one D(Ai(AB)||g), D(A„(AB)||p'). We remark that this 
step is legitimated because all the terms involved in the computation are positive. 
Therefore, by applying the Holder's inequality over the definition of X(AB), we have 
that 



Z(AB)=^/,(AB)D(A f (AB)||/) 

5> ) (max{D(A i (AB)||p')} 



6> AB max{D(A,(AB)||p')} 



(45) 



being 6 AB defined in equation (11). Now we observe that the probability of error in 
referentiating a given event of Q is p e = 1 — 6 AB . This enables us to use Fano's 
inequality to bound 6 AB : 



H(Xn\X' Q )-l 
log(w-l) 



(46) 



thereby obtaining the desired result. 

Lemma 3 - (Derivation of inequality (27)). Let AB be two agents sharing the world 
Q and such that (Vm,- G Q)p(m t ) = l/n and that the channel defined by A(AB) is 
symmetric. Then, the following inequality holds: 



T(AB)<g%lg) + l. 
V ; H(Xa) 



(47) 



Proof - The first issue is to show that, if (Vm,- G Q)p(w f ) = l/n and the channel 
defined by A(AB) is symmetric, then (Vw f eQ) D(A/(AB)||^) = l(X a : X' n ). Indeed, 
since the channel is symmetric p = p' and thus H(Xq) =H(Xq) =log n. Then take 
any m t G Q and compute D(A,-(AB)||p'): 

D(A i (AB)||^)= ^Ay(AB) log Ay(AB)+logn 

j<n 

= \ogn-H(X / Q \X a = m i ) 

=1 °g"-EiE H ( x nl x " =m 0 

i<n n j<n 

= l(Xa:X f a ), 

where in the third step we used the property that, in a symmetric channel, (Vm z , m ; - < 
Q) H^Xq \Xq = m^ =H( K X' n \Xq = rrij) . Thus, if we average a constant value, we 
obtain such a value as the outcome (last step). Then, we apply inequality (41): 



(48) 



J(AB) < 1 



< 1 



H{Xq\X^-l 

log(n-l) 
H(X a \X' Q )-i 



H{X n ) 
I 2 (^:X' Q ) 
H(Xq) 



l(X a :X' n ) 
l(X Q :Xy 



(49) 



+ 1, 



where, in the second step we used the fact that H(X Q ) 
third step we bound the remaining term 



H(X Q ) 



<1, 



= log n > log(n — 1) and in the 



(50) 



be rewritten, using the notation of norms as | (v,u) | < || v|| 



-recall that, for a = /5 



since I(Xq : X' Q ) <H(Xq), thus completing the proof. 

Achieving self-consistency maximizing consistent information. The structure of 
the functional accounting for the amount of consistent information shared by two 
agents -equation (21)- can lead to the paradoxical situation in which high scores on 
X(A : B) do not imply high values of X( A : A) orX(B : B). In brief, the degeneracy of 
possible optimal configurations seems to jeopardize self- understanding even in the 
case in which communication is optimal. Interestingly, this apparent paradox can be 
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ruled out if we consider a population of agents, for several representative cases, as 
demonstrated in 29 using a version of 6 AB . For the particular case where n AB = 0, we 
have seen at the beginning of this section that X(AB) <I(Xq : Xq), having the 
equality only in the special case in which A(AB) = 3„ x«> which, in turn, implies that 
X(AB) = H(Xq). The interesting issue is that in the presence of three or more agents 
A, B and C: 

J(A : B) =H(X Q ) 1 J(A : A) =H(X Q ) 

1(A:C)=H(X Q ) Ul(B:B)=H(I fi ) (51) 

J(B : C) =H(X a ) J J(C : C) =H(X a ) 

i.e., maximizing the communicative success over a population of agents results 
automatically in a population of self-consistent agents, although there is no a-priori 
correlation between the coder and the decoder module of a given agent. Now we 
rigorously demonstrate this statement. 

Lemma 3.- Let us have three A,-, A ; , A k agents communicatively interacting and 
sharing the world Q. Then, if (Vi < k) J(A f : A*) = H(X n ), then 
(Vf)J(A f : A f )=H(X Q ). 

Proof.- We observe, as discussed above, that the premise only holds if (Vz < fc) 

P A ,Q A , A,P Afc ,Q Ak eU nxn , (52) 

and 

P A = (AQ Afc ) r AP A * = (AQ A ) T . (53) 

Now we observe that, if J(A f : A k ) =H(X Q ), j(A j : A ; ) =H(X Q ), we conclude 
that: 

Q Afc = Q A i ; AP A/c = P Aj (54) 

i.e., A fc = A ; -. Now, knowing that X(A^ : A ; ) =H(Xq), then: 

X(A fc : A fc )=H(X n ). (55) 

We can easily generalize this reasoning to an arbitrarily large number of commun- 
icating agents. 
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